In today’s lab we will walk through the steps of building a spatial interaction model of commuting flows between London boroughs, using population size and median salary information. The code in this lab is modified from an example given by Dr. Adam Dennett of UCL’s Center for Applied Spatial Analysis. You will need a set of R packages to carry out all parts of this lab, so make sure these are installed, then start loading them:
library(sf)
library(MASS)
library(reshape2)
library(maptools)
library(dplyr)
library(stplanr)
library(ggplot2)
First, let’s read in the commuting flow data. This is taken from the 2001 UK Census, downloaded from https://wicid.ukdataservice.ac.uk/.
cdata = read.csv("../datafiles/LondonCommuting2001.csv")
head(cdata)
## Orig OrigCode Dest DestCode Total WorksFromHome
## 1 City of London 00AA City of London 00AA 2059 432
## 2 City of London 00AA Barking and Dagenham 00AB 6 0
## 3 City of London 00AA Barnet 00AC 14 0
## 4 City of London 00AA Bexley 00AD 0 0
## 5 City of London 00AA Brent 00AE 16 0
## 6 City of London 00AA Bromley 00AF 0 0
## Underground Train Bus Taxi CarDrive CarPass Motobike Bicycle Walk Other
## 1 120 53 50 31 39 0 3 18 1272 41
## 2 3 3 0 0 0 0 0 0 0 0
## 3 11 0 0 0 0 0 0 0 3 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 10 0 3 0 0 0 0 0 3 0
## 6 0 0 0 0 0 0 0 0 0 0
## OrigCodeNew DestCodeNew OrigPop OrigSal DestPop DestSal
## 1 E09000001 E09000001 12000 38300 12000 38300
## 2 E09000001 E09000002 12000 38300 56000 16200
## 3 E09000001 E09000003 12000 38300 159000 18700
## 4 E09000001 E09000004 12000 38300 112000 18300
## 5 E09000001 E09000005 12000 38300 127000 16500
## 6 E09000001 E09000006 12000 38300 164000 19100
If we take a look at the first few lines of the file you will see that each line gives the commuting flow between one origin (Orig) and one destination (Dest). Commuting flows are given as total flow (Total), and broken down by different modes of transportation. In the last few columns, population and income variables have been added for each origin-destination pair. There are a total of 1089 pairs of commuting flows.
nrow(cdata)
## [1] 1089
Next, read in and plot a shapefile of the London boroughs using the st_read() function:
LondonBNG = st_read("../datafiles/LondonBNG/London.shp")
## Reading layer `London' from data source
## `/Users/u0784726/Dropbox/Data/devtools/geog5160/datafiles/LondonBNG/London.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 33 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 503576.3 ymin: 155850.7 xmax: 561958.7 ymax: 200934
## Projected CRS: Transverse_Mercator
plot(st_geometry(LondonBNG))
Before we get to modeling the flow data, there are a couple of pre-processing steps that we need to undertake.
First, calculate a matrix of distances between all pairs of boroughs (this will be the \(d_{ij}\) term in our gravity model), and for this we use the function st_distance() between the borough centroids (extracted with st_distance()):
dist <- st_distance(st_centroid(LondonBNG))
## Warning in st_centroid.sf(LondonBNG): st_centroid assumes attributes are
## constant over geometries of x
dist <- units::drop_units(dist)
dist[1:10,1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.000 16021.039 13906.457 17310.372 13119.231 18774.436 5725.908
## [2,] 16021.039 0.000 25075.425 9629.985 27872.037 20102.098 20178.865
## [3,] 13906.457 25075.425 0.000 29942.769 7543.571 32664.101 8561.550
## [4,] 17310.372 9629.985 29942.769 0.000 30428.865 11465.489 22852.272
## [5,] 13119.231 27872.037 7543.571 30428.865 0.000 30386.018 7775.904
## [6,] 18774.436 20102.098 32664.101 11465.489 30386.018 0.000 24234.649
## [7,] 5725.908 20178.865 8561.550 22852.272 7775.904 24234.649 0.000
## [8,] 17749.005 26110.225 30228.613 19571.845 25859.849 9836.842 21806.562
## [9,] 16595.005 32325.347 13363.359 33488.710 5940.999 31423.634 12331.705
## [10,] 15150.050 19294.077 9347.132 26574.560 16193.111 32501.537 12613.023
## [,8] [,9] [,10]
## [1,] 17749.005 16595.005 15150.050
## [2,] 26110.225 32325.347 19294.077
## [3,] 30228.613 13363.359 9347.132
## [4,] 19571.845 33488.710 26574.560
## [5,] 25859.849 5940.999 16193.111
## [6,] 9836.842 31423.634 32501.537
## [7,] 21806.562 12331.705 12613.023
## [8,] 0.000 25161.818 32891.779
## [9,] 25161.818 0.000 22128.305
## [10,] 32891.779 22128.305 0.000
Next, we need to transform this for use with the commuting flow data. Here, we use the function melt() from the reshape2 package. This is a neat little function that takes a matrix or some other wide format data, and melts it down into a thin vector format, where the first two columns refer to a row-column pair from the distance matrix, and the third gives the distance between them.
distPair <- melt(dist, varnames = c("Dest","Orig"), value.name = "dist")
head(distPair)
## Dest Orig dist
## 1 1 1 0.00
## 2 2 1 16021.04
## 3 3 1 13906.46
## 4 4 1 17310.37
## 5 5 1 13119.23
## 6 6 1 18774.44
Now add the distance column back into the dataframe, and take a look the modified data frame:
cdata$dist <- distPair$dist
head(cdata)
Next we will swap the order of some of the columns. This is not important for modeling, but will allow us to visualize the flow data. For visualization, we will use the od2line function from the stplanr package. This takes a set of O/D pairs and generates a spatial network between them, by extracting the coordinates of each pair from a spatial object. It requires that the first two columns of the O/D data frame are the codes of the origin and destination that match the spatial location in the shapefile. There are a couple of ways of doing this, but to make our lives as easy as possible, we’ll use the select() function from the dplyr package:
cdata2 <- dplyr::select(cdata, OrigCodeNew, DestCodeNew, Total, everything())
cdata2$Orig <- as.factor(cdata2$Orig)
cdata2$Dest <- as.factor(cdata2$Dest)
And finally, we remove the intra-borough flows. As the distances generated by st_distance() for these are zero, these can cause in the models we generate. These can be included, by setting the internal borough travel distance to some positive value, but to make out lives easier, we will just exclude them here and focus on the inter-borough flows.
cdata2 <- cdata2[cdata2$OrigCode!=cdata2$DestCode,]
With that done, we can visualize the flow data using the od2line function from the stplanr package. This needs a set of pairs of locations (from cdata) and a shapefile with the spatial geometry of the individual locations (this is the shapefile we loaded earlier). With these, it will create a a network between locations:
travel_network <- od2line(flow = cdata2, zones = LondonBNG)
## Creating centroids representing desire line start and end points.
To get some sense of the size of the flows, we can use this variable to set the line widths of the network links (as a weighting factor w), and then plot it:
w <- cdata2$Total / max(cdata2$Total) * 10
plot(st_geometry(LondonBNG))
plot(st_geometry(travel_network), lwd = w, add=TRUE)
Now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdata2mat <- dcast(cdata2, Orig ~ Dest, sum,
value.var = "Total", margins=c("Orig", "Dest"))
cdata2mat[1:10,1:10]
## Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham 0 194 96 178 66 1500
## 2 Barnet 96 0 34 5467 76 12080
## 3 Bexley 362 132 0 144 4998 2470
## 4 Brent 40 6124 28 0 66 8105
## 5 Bromley 134 162 3199 201 0 3780
## 6 Camden 36 1496 32 1350 60 0
## 7 City of London 6 14 0 16 0 335
## 8 Croydon 85 204 300 329 5152 3248
## 9 Ealing 30 967 33 5263 91 4547
## 10 Enfield 217 5642 52 1038 76 5588
## City of London Croydon Ealing
## 1 3641 104 188
## 2 7709 148 1573
## 3 6580 710 188
## 4 4145 187 6703
## 5 9855 6268 293
## 6 8795 147 643
## 7 0 3 9
## 8 5925 0 405
## 9 4855 182 0
## 10 5212 136 626
Now let’s produce a gravity model based on this flow data. The equation for the classic unconstrained gravity model is:
\[ T_{ij}=k V_i^\mu W_j^\alpha d_{ij}^\beta \]
It’s perfectly possible to produce some flow estimates by plugging some arbitrary or expected estimated values into our parameters. The parameters relate to the scaling effect / importance of the variables they are associated with.
A starting assumption may be that the effects of origin and destination attributes on flows scale in a linear fashion (i.e. for a 1 unit increase in the population at an origin results in a 1 unit increase in flows of people from that origin, or halving the average salary at a destination, will half the inflow of commuters). This would imply that both \(\mu\) and \(\alpha\) are set to 1. For \(\beta\), we will use a value of -2, suggesting that commuting flow follows a power law with respect to distance.
With these coefficients, we can start to make our flow estimates. To try and keep this as clear as possible, we’ll estimate each term in the gravity mode equation individually.
mu <- 1
vi1_mu <- cdata2$OrigPop^mu
alpha <- 1
wj2_alpha <- cdata2$DestSal^alpha
beta <- -2
dist_beta <- cdata2$dist^beta
Finally, we need an estimate of \(k\), the proportionality constant. As the name suggests, this is a constant, and is used to avoid over- or under-estimating the total number of flows. When we have some information about the expected flow data, we can estimate \(k\) based on the total flows in the region and the equation given above:
\[ k = \frac{T}{\sum_i \sum_j V_i^\mu W_j^\alpha d_{ij}^\beta} \]
Where \(T\) is the total number of all commuting flows or trips in the dataset. Effectively, this then scales the amount of flows based on population and salary alone to the observed number of trips, which has the effect of making the predicted value of the total number of trips (\(\sum \hat{T}_{ij}\)) correct. Note that if we don’t know the total flow data, \(k\) is usually taken from previous studies or by trial and error. (While we refer to this model as an unconstrained gravity model, the use of \(\sum T_{ij}\) technically makes this a total constrained model.)
k = sum(cdata2$Total) / sum(vi1_mu * wj2_alpha * dist_beta)
We can now use the gravity model to estimate the flows by simply multiplying all this together, and store this back in the original data frame:
cdata2$unconstrainedEst1 <- round(k * vi1_mu * wj2_alpha * dist_beta, 0)
To make sure that \(k\) is doing it’s job correctly, estimate the observed total number of flows:
sum(cdata2$Total)
## [1] 1800413
And compare this to the predicted total:
sum(cdata2$unconstrainedEst1)
## [1] 1800411
If you want to see the results as an O/D matrix, we can use the cast() or dcast() function to turn the set of inter-borough flows into a matrix.
#turn it into a little matrix and have a look at your handy work
cdata2mat1 <- dcast(cdata2, Orig ~ Dest, sum,
value.var = "unconstrainedEst1", margins=c("Orig", "Dest"))
And if we look at a part of this matrix, we can see the resulting predicted flows:
cdata2mat1[1:10,1:10]
## Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham 0 178 1180 127 283 291
## 2 Barnet 438 0 347 4925 304 4588
## 3 Bexley 2090 250 0 213 1738 454
## 4 Brent 283 4458 268 0 281 4442
## 5 Bromley 702 307 2439 313 0 591
## 6 Camden 429 2752 378 2944 351 0
## 7 City of London 81 124 78 123 69 774
## 8 Croydon 388 334 781 403 3226 681
## 9 Ealing 234 1577 246 7041 291 1961
## 10 Enfield 595 2927 354 860 247 1702
## City of London Croydon Ealing
## 1 893 161 100
## 2 3364 340 1664
## 3 1529 572 187
## 4 3019 371 6726
## 5 1904 3313 310
## 6 12603 415 1242
## 7 0 74 81
## 8 1987 0 452
## 9 2095 435 0
## 10 2282 231 489
For a small set of locations, we can take a look at this matrix and compare it to known flows (the cdata2mat matrix shown above). For larger datasets, this quickly becomes impractical. Instead, let’s calculate the root mean squared error (RMSE) between the observed and predicted flows.
As a reminder, the RMSE is given by the following equation, and is basically an estimate of the average error in the model estimates. The closer to 0 the RMSE value, the better the model.
\[ RMSE = \sqrt{\frac{\sum_i(y_i-\hat{y}_i)^2}{n}} \]
sqrt(mean((cdata2$Total-cdata2$unconstrainedEst1)^2))
## [1] 3179.464
The traditional way to improve the model fit has been by gradually changing the set of parameters, and rebuilding the model until the RMSE is as low as possible. However, we can also do this by regression. If we log-transform the gravity model given above, we get the following equation:
\[ \mbox{log} T_{ij}=\mbox{log}(k) + \mu \mbox{log}(V_i) + \alpha \mbox{log} (W_j) + \beta \mbox{log} (d_{ij}) \]
Which you should recognize as a basic linear regression model, where the intercept gives us an estimate of \(\mbox{log}(k)\), and the slope coefficients give us estimates of \(\mu\), \(\alpha\) and \(\beta\), respectively.
To get a sense of what this is trying to do, let’s make a plot of the untransformed flow data against the individual OD distances.
plot(cdata$dist, cdata$Total,
main="London commuting flow data", xlab="Dist (dij)", ylab="Flow (Tij)")
Not a very linear fit between the flow and distance. If we do the same, but log-transform both variables, we get the following:
plot(log(cdata$dist), log(cdata$Total),
main="London commuting flow data",
xlab="Dist (dij)", ylab="Flow (Tij)")
And while there is a lot of scatter, we can see a linearly declining relationship: less people travel over longer distances. If we could estimate the slope of this decline, this would give us the estimate of \(\beta\).
While it is entirely feasible to fit this using a regular linear regression, based on ordinary least squares (OLS), this approach has been criticized as being unsuitable for modeling flow data. OLS models are designed for use with continuous, unbounded outcome variable (i.e. something that can potentially take a value from -ve infinity to +ve infinity). Flow data cannot be negative, at the minimum a flow is equal to or bounded by zero. Also, we are generally dealing with unit objects: a person, a car, etc. A better approach is based on the use of Poisson regression, another form of generalized linear model (GLM). This models the flows directly as untransformed counts, using the following equation:
\[ T_{ij}=\mbox{exp}( \mbox{log}(k) + \mu \mbox{log}(V_i) + \alpha \mbox{log} (W_j) + \beta \mbox{log} (d_{ij}) ) \]
All this has done is to eliminate the log term on the left hand side and replace it with an exponent on the right hand side. This means that we model the untransformed counts, which by definition are non-negative (bounded at zero) and integer values. This has some further advantages in how the model treats the errors, which make this approach more robust.
To fit this in R, we use the glm() function, which can be used to fit a GLM. Note that we need to specify that this is a Poisson model with the family = poisson() argument. The syntax to fit this for the London flow data is:
uncosim <- glm(Total ~ log(OrigPop)+log(DestSal)+log(dist),
family = poisson(link = "log"), data = cdata2)
Note that we log transform the covariates (population, etc). Now let’s look at the output:
summary(uncosim)
##
## Call:
## glm(formula = Total ~ log(OrigPop) + log(DestSal) + log(dist),
## family = poisson(link = "log"), data = cdata2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -148.329 -26.953 -16.723 1.279 189.270
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.522966 0.047412 -264.1 <2e-16 ***
## log(OrigPop) 1.622671 0.003130 518.4 <2e-16 ***
## log(DestSal) 1.549826 0.002979 520.2 <2e-16 ***
## log(dist) -1.498169 0.001268 -1181.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3374332 on 1055 degrees of freedom
## Residual deviance: 1413323 on 1052 degrees of freedom
## AIC: 1421746
##
## Number of Fisher Scoring iterations: 5
So what does all this tell us? The most important is the table of coefficients. We get the following estimates:
The final column of this table gives an estimate of the significance of the coefficient (Pr(>\z\)). The lower this is, the higher the significance in explaining the flow data. Here all the coefficients we have used are highly significant.
Comparing these to the original simple model above, we see that this first model underestimates the coefficients for emission (\(\mu\)) attraction (\(\alpha\)). As both of these are above 1, it suggest that as both population and salary increase in an area, this results in proportionally higher commuting flows. The distance coefficient (\(\beta\)) was overestimated, but is still above 1, indicating the per-unit travel cost increase as distance increases.
As before, we can calculate the RMSE for this model to compare against the simple model. To do this we use the fitted() function to get the predicted flows for each pair of locations, then round up.
cdata2$unconstrainedEst2 = round(fitted(uncosim), 0)
head(cdata2$unconstrainedEst2)
## [1] 25 39 27 35 26 162
Now reuse the code from above to calculate the RMSE.
## Model 1
sqrt(mean((cdata2$Total-cdata2$unconstrainedEst1)^2))
## [1] 3179.464
## Model 2
sqrt(mean((cdata2$Total-cdata2$unconstrainedEst2)^2))
## [1] 2331.069
Which shows that, for our calibrated model, the RMSE has dropped by approximately 25%.
And calculate the total estimated flow:
sum(cdata2$unconstrainedEst2)
## [1] 1800399
Which matches well to the total observed flow:
sum(cdata2$Total)
## [1] 1800413
This is unsurprising as we used the sum of all flows to calibrate the \(k\) coefficient. Next convert your estimates back to an O/D matrix and look at the estimates:
cdata2mat2 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "unconstrainedEst2", margins=c("Orig", "Dest"))
cdata2mat2[1:10,1:10]
While the total is well constrained, the margins of the matrix are not. For example, we can compare the estimates of total outflow from each origin by looking at the row totals of the observed O/D matrix, and the new modeled one. If we exclude the extreme values, the predicted totals appear somewhat biased.
totoutflowObs = cdata2mat[,"(all)"]
totoutflowPred = cdata2mat2[,"(all)"]
plot(totoutflowObs, totoutflowPred, log="xy",
xlab="Observed outflow", ylab="Predicted outflow")
abline(0,1)
And we can do the same for the inflows. To do this, we need to extract the relevant row, which is a little more complex.
totinflowObs = as.numeric(cdata2mat[cdata2mat$Orig=="(all)", -1])
totinflowPred = as.numeric(cdata2mat2[cdata2mat2$Orig=="(all)", -1])
plot(totinflowObs, totinflowPred, log="xy",
xlab="Observed inflow", ylab="Predicted inflow")
abline(0,1)
Now, let’s recycle the code from above to plot the predicted flows:
w <- cdata2$unconstrainedEst2 / max(cdata2$unconstrainedEst2) * 10
plot(st_geometry(LondonBNG))
plot(st_geometry(travel_network), lwd = w, add=TRUE)
We can also plot the model errors on the same network. To do this, we first calculate the residuals, then translate them to network weights, using a somewhat arbitrary weighting. Finally, we plot the absolute values, color coded by if they are positive (red) or negative (blue) errors).
r <- cdata2$unconstrainedEst2 - cdata2$Total
w <- abs(r) / sum(abs(r)) * 250
err <- ifelse(r > 0, 2, 4)
plot(st_geometry(LondonBNG))
plot(st_geometry(travel_network), lwd = w, add=TRUE, col=err)
The work here is designed to show that it is relatively easy to build a range of spatial interaction models, from arbitrary hand-picked coefficients, to more complex, constrained models. Placing these in the framework of Poisson regression allows a fairly straightforward way to build the models, but also offers the possibility of extending these model by including more variables that may be relevant to attracting or emitting flows between locations.
It should also be possible to see from this, that the same approach could be applied to more specific flows (the London commuting file contains information about different modes for example), or extended to include different ‘cost’ information to replace the distance variable. Other improvements include modeling intra-regional flows as well as flows between regions, and the use of more complex models (e.g. negative binomial) to deal with some of the limitations of the Poisson model.
The CSV file aussie_flow.csv contains 5-year migration data between Australian Greater Capital City Statistical Areas (GCCSA). The file contains the following columns
Origin: Origin GCCSAOrig_code: Origin GCCSA codeDestination: Destination GCCSADest_code: Destination GCCSA codeFlow: Migration flow (including intra-GCCSA flow)vi1_origpop: origin population sizewj1_destpop: destination population sizevi2_origunemp: origin unemployment ratewj2_destunemp: destination unemployment ratevi3_origmedinc: origin median incomewj3_destmedinc: destination median incomevi4_origpctrent: origin percent rentalswj4_destpctrent: destination percent rentalsFlowNoIntra: Migration flow (excluding intra-GCCSA flow)offset: constant offsetdist: distance between GCCSA centroids (km)Using these data, build two unconstrained (not constrained!) gravity models of migration flow, using the origin population as the emission factor and destination income as the attraction factor. As the file also contains intra-zone migration, you will need to remove those before estimating the models, by finding all observations with the same origin and destination:
ausdata <- read.csv("../datafiles/aussie_flow.csv")
ausdata <- ausdata[ausdata$Orig_code != ausdata$Dest_code,]
mu <- 1
vi1_mu <- ausdata$OrigPop^mu
alpha <- 1
wj2_alpha <- ausdata$DestSal^alpha
beta <- -2
dist_beta <- ausdata$dist^beta
k <- sum(ausdata$Total) / sum(vi1_mu * wj2_alpha * dist_beta)
glm function to fit a Poisson model of the flow data as a function of origin population and destination incomesummary of the modelTHIS IS AN OPTIONAL PART OF THE LAB
In the previous step, we estimated an unconstrained model. While here, we were able to calibrate this using the full set of flow (as these were available), this model can also be fit to partial, incomplete O/D flow information. As we have the full O/D matrix, we can improve this model by using extra information to constrain the model.
If we use the row (origin) totals, we get a production constrained model, e.g. if we have information on the amount of disposable income and shopping habits of the people living in different areas from loyalty card data
If we use the column (destination) totals, we get an attraction constrained model, e.g. if we have information on new infrastructure developments such as a new airport, and wish to understand the impact on transportation flows
If we have use both row and column totals, we get the doubly constrained model
This model is given by
\[ T_{ij} = A_i O_i W_j^\alpha d_{ij}^\beta \]
Where \(O_i\) represents the total outflow from zone \(i\) \[ O_i=\sum_j T_{ij} \]
And \(A_i\) is a balancing factor it account for different levels of productivity. To model this, we just rewrite the equation of the Poisson model to allow estimates of this factor:
\[ T_{ij}=\mbox{exp}( k + \mu_i + \alpha \mbox{log} (W_j) + \beta \mbox{log} (d_{ij}) ) \]
Here the coefficient \(\mu_i\) is a vector of balancing factors for each zone. In a model, this is just a set of dummy variables, which we represent here by using the Borough name. In R, the syntax for this model is given as:
prodSim <- glm(Total ~ Orig + log(DestSal) + log(dist),
family = poisson(link = "log"), data = cdata2)
summary(prodSim)
The main difference between this and the previous model is that we no longer have a single estimate of \(\mu\), but instead have set of coefficients for each origin. You might see that the first location is missing (the borough of Barking and Dagenham). This is used as the the reference location, and the rest of the coefficients indicate how important the other locations are as origins for commuters.
We might wish to have a less arbitrary choice of location as origin. For London, the City of London (shown on the map below) can be assumed to be the main destination for commuters.
plot(st_geometry(LondonBNG))
plot(st_geometry(LondonBNG[LondonBNG$lad15nm=="City of London",]), add=TRUE, col=2)
To set this as the reference level in R, use the following code:
cdata2$Orig2 = relevel(cdata2$Orig, ref = "City of London")
And let’s use this to build a second model with the City of London as reference:
prodSim2 <- glm(Total ~ Orig2 + log(DestSal) + log(dist),
family = poisson(link = "log"), data = cdata2)
summary(prodSim2)
Now all these coefficients (\(\mu_i\)) are relative to the City of London. Let’s map them out to see what the spatial pattern looks like. Here, we use the coef() function to extract these, and add them to the shapefile. Note that we need to remove the 1st and last two coefficients (the intercept and \(\alpha\) and \(\beta\) slopes).
modcoef = coef(prodSim2)[-c(1, 34, 35)]
LondonBNG$prodcoef = c(0, modcoef)
And then plot it, to show how commuting flows increase as we move further away from the center closer to the suburbs.
plot(LondonBNG["prodcoef"])
Now let’s explore the predicted flows. First, get the flow for each O/D pair from the original model:
cdata2$prodsimFitted <- round(fitted(prodSim), 0)
Then create an O/D matrix, with the marginal sums of total outflow (rows) and inflows (columns):
cdata2mat3 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "prodsimFitted",
margins=c("Orig", "Dest"))
cdata2mat3
If we now redo the comparison of the outflow values, we can see that the constraint has worked and the outflows match.
totoutflowObs = cdata2mat[, "(all)"]
totoutflowPred = cdata2mat3[, "(all)"]
plot(totoutflowObs, totoutflowPred, log="xy",
xlab = "Observed outflow", ylab = "Predicted outflow")
abline(0, 1)
Including the constraint has also improved the model RMSE, showing some benefit of incorporating this extra information in the model.
sqrt(mean((cdata2$Total - cdata2$prodsimFitted)^2))
## [1] 2204.673
The attraction constrained model uses information about the total inflow to each region to constrain the model:
\[ T_{ij} = D_j B_j V_i^\mu d_{ij}^\beta \]
Where \(D_i\) represents the total outflow from zone \(i\) \[ D_j=\sum_i T_{ij} \]
And \(B_j\) is a balancing factor it account for different levels of attraction. We make a similar modification to the Poisson model to accommodate this:
\[ T_{ij}=\mbox{exp}( k + \mu \mbox{log} V_i + \alpha_i + \beta \mbox{log} (d_{ij}) ) \]
Now we have a vector of dummy variables (\(\alpha_i\)), one per zone, representing the level of attractiveness for commuting. In R, we simply include the destinations rather than the origins. Again, we’ll set the City of London as the destination:
cdata2$Dest2 = relevel(cdata2$Dest, ref = "City of London")
attrSim <- glm(Total ~ Dest2 + log(OrigPop) + log(dist),
family = poisson(link = "log"), data = cdata2)
summary(attrSim)
coef(attrSim)
The output here again contains a single coefficient per region (minus the reference region), but here indicates how much more (or less) likely it is to be a destination for commuters.
modcoef = coef(attrSim)[-c(1,34,35)]
LondonBNG$attrcoef = c(0, modcoef)
And then plot it, to show how commuting flows increase as we move further away from the center closer to the suburbs.
plot(LondonBNG["attrcoef"])
Now let’s get the flow for each O/D pair, and compare to the observations:
cdata2$attrsimFitted <- round(fitted(attrSim),0)
Create an O/D matrix, with the marginal sums of total outflow (rows) and inflows (columns):
cdata2mat4 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "attrsimFitted",
margins=c("Orig", "Dest"))
cdata2mat4
Now redo the comparison of the inflow values:
totinflowObs = as.numeric(cdata2mat[cdata2mat$Orig=="(all)", -1])
totinflowPred = as.numeric(cdata2mat4[cdata2mat4$Orig=="(all)", -1])
plot(totinflowObs, totinflowPred, log = "xy",
xlab = "Observed inflow", ylab = "Predicted inflow")
abline(0,1)
And calculate the RMSE, which shows that including the destination constraint has a substantial improvement on the model:
sqrt(mean((cdata2$Total-cdata2$attrsimFitted)^2))
## [1] 1681.193
Of course, the next obvious step is to combine these two constraints in the doubly constrained model:
\[ T_{ij} = A_i O_i D_j B_j d_{ij}^\beta \]
Estimating this by hand is somewhat difficult, as the value of \(A_i\) depends on \(B_j\) and vice versa. It therefore requires a set of iterations to converge on the best estimates. In our case, however, there is no need to do this, and we can just run Poisson regression with a full set of the full set of factors (origins and destinations).
doubSim <- glm(Total ~ Orig + Dest + log(dist),
family = poisson(link = "log"), data = cdata2)
#let's have a look at it's summary...
summary(doubSim)
As before, get the estimated flow for each O/D pair:
cdata2$doubsimFitted <- round(fitted(doubSim),0)
Then create an O/D matrix, with the marginal sums of total outflow (rows) and inflows (columns):
cdata2mat5 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "doubsimFitted",
margins=c("Orig", "Dest"))
cdata2mat5[1:10,1:10]
The double constraint results in a marked improvement in the model fit, and a drop in the RMSE.
sqrt(mean((cdata2$Total-cdata2$doubsimFitted)^2))
## [1] 1136.545